Goto

Collaborating Authors

 data generator



Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Neural Information Processing Systems

Accessible high-quality data is the bread and butter of machine learning research, and the demand for data has exploded as larger and more advanced ML models are built across different domains. Yet, real data often contain sensitive information, are subject to various biases, and are costly to acquire, which compromise their quality and accessibility. Synthetic data have thus emerged as a complement to, sometimes even a replacement for, real data for ML training. However, the landscape of synthetic data research has been fragmented due to the diverse range of data modalities, such as tabular, time series, and images, and the wide array of use cases, including privacy preservation, fairness considerations, and data augmentation. This fragmentation poses practical challenges when comparing and selecting synthetic data generators in for different problem settings. To this end, we develop Synthcity, an open-source Python library that allows researchers and practitioners to perform one-click benchmarking of synthetic data generators across data modalities and use cases. Beyond benchmarking, Synthcity serves as a centralized toolkit for accessing cutting-edge data generators. In addition, Synthcity's flexible plug-in style API makes it easy to incorporate additional data generators into the framework. Using examples of tabular data generation and data augmentation, we illustrate the general applicability of Synthcity, and the insight one can obtain.





Tree-Based Deep Learning for Ranking Symbolic Integration Algorithms

Barket, Rashid, England, Matthew, Gerhard, Jürgen

arXiv.org Artificial Intelligence

Symbolic indefinite integration in Computer Algebra Systems such as Maple involves selecting the most effective algorithm from multiple available methods. Not all methods will succeed for a given problem, and when several do, the results, though mathematically equivalent, can differ greatly in presentation complexity. Traditionally, this choice has been made with minimal consideration of the problem instance, leading to inefficiencies. We present a machine learning (ML) approach using tree-based deep learning models within a two-stage architecture: first identifying applicable methods for a given instance, then ranking them by predicted output complexity. Furthermore, we find representing mathematical expressions as tree structures significantly improves performance over sequence-based representations, and our two-stage framework outperforms alternative ML formulations. Using a diverse dataset generated by six distinct data generators, our models achieve nearly 90% accuracy in selecting the optimal method on a 70,000 example holdout test set. On an independent out-of-distribution benchmark from Maple's internal test suite, our tree transformer model maintains strong generalisation, outperforming Maple's built-in selector and prior ML approaches. These results highlight the critical role of data representation and problem framing in ML for symbolic computation, and we expect our methodology to generalise effectively to similar optimisation problems in mathematical software.


Data Swarms: Optimizable Generation of Synthetic Evaluation Data

Feng, Shangbin, Wang, Yike, Shi, Weijia, Tsvetkov, Yulia

arXiv.org Artificial Intelligence

We propose Data Swarms, an algorithm to optimize the generation of synthetic evaluation data and advance quantitative desiderata of LLM evaluation. We first train a swarm of initial data generators using existing data, and define various evaluation objectives to reflect the desired properties of evaluation (e.g., generate more difficult problems for the evaluated models) and quantitatively evaluate data generators. We then employ particle swarm optimization to optimize the swarm of data generators, where they collaboratively search through the model parameter space to find new generators that advance these objectives. We further extend it to Adversarial Swarms, where the data generator swarm generates harder data while the test taker model swarm learns from such data, co-evolving dynamically for better data and models simultaneously. Extensive experiments demonstrate that Data Swarms outperforms eight data generation baselines across five evaluation objectives, while Adversarial Swarms produce more robust learning of synthetic data and stronger generalization. Further analysis reveals that Data Swarms successfully optimizes compositions of multiple evaluation objectives and generalizes to new off-the-shelf LLMs, unseen at optimization time.


Performative Drift Resistant Classification Using Generative Domain Adversarial Networks

Makowski, Maciej, Gower-Winter, Brandon, Krempl, Georg

arXiv.org Artificial Intelligence

Performative Drift is a special type of Concept Drift that occurs when a model's predictions influence the future instances the model will encounter. In these settings, retraining is not always feasible. In this work, we instead focus on drift understanding as a method for creating drift-resistant classifiers. To achieve this, we introduce the Generative Domain Adversarial Network (GDAN) which combines both Domain and Generative Adversarial Networks. Using GDAN, domain-invariant representations of incoming data are created and a generative network is used to reverse the effects of performative drift. Using semi-real and synthetic data generators, we empirically evaluate GDAN's ability to provide drift-resistant classification. Initial results are promising with GDAN limiting performance degradation over several timesteps. Additionally, GDAN's generative network can be used in tandem with other models to limit their performance degradation in the presence of performative drift. Lastly, we highlight the relationship between model retraining and the unpredictability of performative drift, providing deeper insights into the challenges faced when using traditional Concept Drift mitigation strategies in the performative setting.


Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer

Wu, Yulun, Bergman, Doron L.

arXiv.org Artificial Intelligence

We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without pre-training on any real-world dataset, extending on the recent development of Prior-Data Fitted Networks (PFNs) and TabPFN. Specifically, APT is pre-trained with adversarial synthetic data agents, who continue to shift their underlying data generating distribution and deliberately challenge the model with different synthetic datasets. In addition, we propose a mixture block architecture that is able to handle classification tasks with arbitrary number of classes, addressing the class size limitation -- a crucial weakness of prior deep tabular zero-shot learners. In experiments, we show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance TabPFN's performance. In our analysis, we demonstrate that the adversarial synthetic data agents were able to generate a more diverse collection of data compared to the ordinary random generator in TabPFN. In addition, we demonstrate that our mixture block neural design has improved generalizability and greatly accelerated pre-training.


Preference Leakage: A Contamination Problem in LLM-as-a-judge

Li, Dawei, Sun, Renliang, Huang, Yue, Zhong, Ming, Jiang, Bohan, Han, Jiawei, Zhang, Xiangliang, Wang, Wei, Liu, Huan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.